AIML Module Project - SUPERVISED LEARNING


Importing Required Python Modules and Libraries

Here we are importing all the Libraries and Modules that are needed for whole project in a single cell.


Part ONE - Project Based


DOMAIN: Healthcare

CONTEXT: Medical research university X is undergoing a deep research on patients with certain conditions. University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.

DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.

  1. P_incidence
  2. P_tilt
  3. L_angle
  4. S_slope
  5. P_radius
  6. S_degree
  7. Class

    PROJECT OBJECTIVE: Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms.

Steps and tasks:

  1. Import and warehouse data:
    • Import all the given datasets and explore shape and size of each.
    • Merge all datasets onto one and explore final shape and size.
  1. Data cleansing:
    • Explore and if required correct the datatypes of each attribute
    • Explore for null values in the attributes and if required drop or impute values.
  1. Data analysis & visualisation:
    • Perform detailed statistical analysis on the data.
    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
  1. Data pre-processing:
    • Segregate predictors vs target attributes
    • Perform normalisation or scaling if required.
    • Check for target balancing. Add your comments.
    • Perform train-test split.
  1. Model training, testing and tuning:
    • Design and train a KNN classifier.
    • Display the classification accuracies for train and test data.
    • Display and explain the classification report in detail.
    • Automate the task of finding best values of K for KNN.
    • Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.
  1. Conclusion and improvisation:
    • Write your conclusion on the results.
    • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the research team to perform a better data analysis in future.

1. Import and Warehouse Data:

* Import all the given Datasets and Explore Shape and Size of each.


Key Observations:-


* Merge all Datasets onto One and Explore Final Shape and Size.


Key Observations:-


2. Data Cleansing:

* Explore and If Required Correct the Datatypes of Each Attribute.


Key Observations:-


* Explore for Null Values in the Attributes and If Required Drop or Impute values.

Comments:-

Comments:-


Key Observations:-


3. Data Analysis & Visualisation:

* Perform Detailed Statistical Analysis on the Data.

Brief Summary of Data

Summary Based on Dependent Variable

Checking skewness of the data attributes

Getting Interquartile Range of data attributes

Checking Covariance related with all independent attributes

Checking Correlation by plotting Heatmap for all independent attributes


Key Observations:-


* Perform a detailed Univariate, Bivariate and Multivariate Analysis with Appropriate detailed comments after Each Analysis.

Univariate Analysis

Univariate Analysis is done on each attribute. Here we check the distribution of the attribute along with 5 point summary.

Attribute 1: P_incidence


Key Observations:-


Attribute 2: P_tilt


Key Observations:-


Attribute 3: L_angle


Key Observations:-


Attribute 4: S_slope


Key Observations:-


Attribute 5: P_radius


Key Observations:-


Attribute 6: S_Degree


Key Observations:-


Attribute 7: Class


Key Observations:-


Bivariate Analysis

Bivariate Analysis 1: Class VS P_incidence


Key Observations:-


Bivariate Analysis 2: Class VS P_tilt


Key Observations:-


Bivariate Analysis 3: Class VS L_angle


Key Observations:-


Bivariate Analysis 4: Class VS S_slope


Key Observations:-


Bivariate Analysis 5: Class VS P_radius


Key Observations:-


Bivariate Analysis 6: Class VS S_Degree


Key Observations:-


Multivariate Analysis

Multivariate analysis is performed to understand interactions between different fields in the dataset.

Multivariate Analysis : To Check Relation Between Independent Attributes


Key Observations:-


Multivariate Analysis : To check Density of Categorical Attribute in all other Attributes


Key Observations:-


Multivariate Analysis : To Check Correlation


Key Observations:-


4. Data Pre-Processing:

Outlier Analysis

NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean


Key Observations:-


Encoding Categorical Attribute

We have "Class" as our Target Attribute with three different classes.

  1. Normal
  2. Type_H
  3. Type_S

Key Observations:-


* Segregate Predictors VS Target Attributes

By sperating Predictors and Target attributes, we can perform further operations easily.


Key Observations:-


* Perform Normalisation or Scaling if Required.

Since our Data is Normally Distributed, we will do Feature Scaling Standardization(Z-Score Normalization).

By Standardizing the values of dataset, we get the following statistics of the data distribution,


Key Observations:-


* Check for Target Balancing. Add your comments.


Key Observations:-


Fixing Target Imbalance by Synthetic Minority Oversampling Technique (SMOTE)

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.


Key Observations:-


* Perform Train-Test Split.


Key Observations:-


5. Model Training, Testing and Tuning:

* Design and Train a KNN Classifier.


Key Observations:-


* Display the Classification Accuracies for Train and Test Data.


Key Observations:-


* Display and Explain the Classification Report in detail.


Key Observations:-


Evaluating Performance of kNN Model


Key Observations:-


* Automate the Task of Finding Best Values of K for KNN.

Step 1 : Calculating Misclassification Errors for all the Predicted values where K-Value will be odd and ranges from 1 to 50


Key Observations:-


Step 2 : Plotting the Misclassification Error Values against K-Values to find Best Range of K-Value


Key Observations:-


Step 2 : Finding Best K-Value Within Range 1-20.


Key Observations:-


Building KNN Model with K = 3 to Check Accuracies


Key Observations:-


* Apply all the possible Tuning Techniques to Train the best Model for the given data. Select the final best Trained Model with your comments for selecting this Model.

List of different Models Applied to our Data:-

  1. LogisticRegression()
  2. KNeighborsClassifier()
  3. GaussianNB()
  4. SVC(kernel='rbf')
  5. SVC(kernel='linear')
  6. RandomForestClassifier()
  7. DecisionTreeClassifier()

Key Observations:-


6. Conclusion and Improvisation:

* Write your Conclusion on the Results.

* Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the research team to perform a better data analysis in future.

Closing Sentence:- All Predictions that are generated by training Supervised Learning algorithms will help medical research University X for to understand patients with certain conditions.

-------------------------------------------------- End of Part ONE -------------------------------------------------------



Part TWO - Project Based


DOMAIN: Banking and Finance

CONTEXT: A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base whee majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.

DATA DESCRIPTION: The data consists of the following attributes:

  1. ID: Customer ID
  2. Age Customer’s approximate age.
  3. CustomerSince: Customer of the bank since. [unit is masked]
  4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
  5. ZipCode: Customer’s zip code.
  6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
  7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
  8. Level: A level associated to the customer which is masked by the bank as an IP.
  9. Mortgage: Customer’s mortgage. [unit is masked]
  10. Security: Customer’s security asset with the bank. [unit is masked]
  11. FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]
  12. InternetBanking: if the customer uses internet banking.
  13. CreditCard: if the customer uses bank’s credit card.
  14. LoanOnCard: if the customer has a loan on credit card.

    PROJECT OBJECTIVE: Build an AIML model to perform focused marketing by predicting the potential customers who will convert using the historical dataset.

Steps and tasks:

  1. Import and warehouse data:
    • Import all the given datasets and explore shape and size of each.
    • Merge all datasets onto one and explore final shape and size.
  1. Data cleansing:
    • Explore and if required correct the datatypes of each attribute
    • Explore for null values in the attributes and if required drop or impute values.
  1. Data analysis & visualisation:
    • Perform detailed statistical analysis on the data.
    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
  1. Data pre-processing:
    • Segregate predictors vs target attributes
    • Check for target balancing and fix it if found imbalanced.
    • Perform train-test split.
  1. Model training, testing and tuning:
    • Design and train a Logistic regression and Naive Bayes classifiers.
    • Display the classification accuracies for train and test data.
    • Display and explain the classification report in detail.
    • Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.
  1. Conclusion and improvisation:
    • Write your conclusion on the results.
    • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the bank to perform a better data analysis in future.

1. Import and Warehouse Data:

* Import all the given Datasets and Explore Shape and Size of each.


Key Observations:-


* Merge all Datasets onto One and Explore final Shape and Size.


Key Observations:-


2. Data Cleansing:

* Explore and If Required Correct the Datatypes of each Attribute.


Key Observations:-


* Explore for Null Values in the Attributes and If Required Drop or Impute values.

Comments:-


Key Observations:-


Changing Datatype of LoanOnCard Attribute


Key Observations:-


3. Data Analysis & Visualisation:

* Perform Detailed Statistical Analysis on the Data.

Brief Summary of Data

Information about the Features

To Preprocess the data, we first differentiate between different types of Attributes.

1. Qualitative Attributes:-

2. Quantitative Attributes:-

Dropping ID and ZipCode Attributes

Since these two attributes do not give much information and are not helpful for our further process, we are dropping these attributes.


Cleaning Negative Values of CustomerSince Attribute

Step 1 : Getting information about Negative Values

Step 2 : Performing suitable operation on Negative values

Checking skewness of the data attributes

Checking Covariance related with attributes

Checking Correlation by plotting Heatmap for attributes


Key Observations:-


* Perform a detailed Univariate, Bivariate and Multivariate Analysis with Appropriate detailed comments after Each Analysis.

Univariate Analysis

Univariate analysis is the simplest form of analyzing data. It involves only one variable.

Attribute 1: Age


Key Observations:-


Attribute 2: CustomerSince


Key Observations:-


Attribute 3: HighestSpend


Key Observations:-


Attribute 4: HiddenScore


Key Observations:-


Attribute 5: MonthlyAverageSpend


Key Observations:-


Attribute 6: Level


Key Observations:-


Attribute 7: Mortgage


Key Observations:-


Attribute 8: Security


Key Observations:-


Attribute 9: FixedDepositAccount


Key Observations:-


Attribute 10: InternetBanking


Key Observations:-


Attribute 11: CreditCard


Key Observations:-


Attribute 12: LoanOnCard


Key Observations:-


Outliers in Each Discrete Attribute

Since Categorical data outliers doesn't give any meaning, we are neglecting them.


Key Observations:-


Bivariate Analysis

Bivariate Analysis 1: Age VS All Categorical Attributes


Key Observations:-


Bivariate Analysis 2: CustomerSince VS All Categorical Attributes


Key Observations:-


Bivariate Analysis 3: HighestSpend VS All Categorical Attributes


Key Observations:-


Bivariate Analysis 4: MonthlyAverageSpend VS All Categorical Attributes


Key Observations:-


Bivariate Analysis 5: Mortgage VS All Categorical Attributes


Key Observations:-


Multivariate Analysis

Multivariate analysis is performed to understand interactions between different fields in the dataset.

Multivariate Analysis : To Check Relation Between Discrete Attributes


Key Observations:-


Multivariate Analysis : To check Density of Target Attribute in all other Discrete Features


Key Observations:-


Multivariate Analysis : To Check Correlation


Key Observations:-


4. Data Pre-Processing:

Outlier Analysis

NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean


Key Observations:-


Feature Scaling Standardization(Z-Score Normalization)

Scaling is needed for our data, we will scale all discrete features with Z-Score Normalization. By Standardizing the values of dataset, we get the following statistics of the data distribution,


Key Observations:-


* Segregate Predictors VS Target Attributes

By sperating Predictors and Target attributes, we can perform further operations easily.


Key Observations:-


* Check for Target Balancing and Fix it if found Imbalanced.


Key Observations:-


Fixing Target Imbalance by Synthetic Minority Oversampling Technique (SMOTE)

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.


Key Observations:-


* Perform Train-Test Split.


Key Observations:-


5. Model Training, Testing and Tuning:

* Design and Train a Logistic Regression and Naive Bayes Classifiers.

Logistic Regression

Naive Bayes Classifiers : Gaussian


Key Observations:-


* Display the Classification Accuracies for Train and Test Data.


Key Observations:-


* Display and Explain the Classification Report in detail.


Key Observations:-


Evaluating Performance of Logistic Regression and Naive Bayes Models


Key Observations:-


* Apply all the possible Tuning Techniques to Train the best Model for the given data. Select the final best Trained Model with your comments for selecting this Model.

List of different Models Applied to our Data:-

  1. LogisticRegression()
  2. KNeighborsClassifier()
  3. GaussianNB()
  4. SVC(kernel='rbf')
  5. SVC(kernel='linear')
  6. RandomForestClassifier()
  7. DecisionTreeClassifier()

Key Observations:-


6. Conclusion and Improvisation:

* Write your Conclusion on the Results.

* Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the research team to perform a better data analysis in future.

Closing Sentence:- The Predictions made by our models will help the Bank X in predicting the potential customers who will convert using the historical dataset.

------------------------------------------------- End of Part TWO -----------------------------------------------------

--------------------- End of AIML MODULE PROJECT 2 ---------------------

------------------------------------------------------------------------------THANK YOU😊----------------------------------------------------------------------------------